{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [kaggle][学习向]sf-crime数据的清洗及格式化\n",
    "\n",
    "在上一篇[kaggle][学习向]sf-crime数据的可视化分析中，我们选取出了12个特征，但是数据仍然存放在csv表格中，我们将在这一篇中将数据进行清洗、格式化并存储成numpy易读取的模式，便于下一步的处理，下面开始编写程序，如果有不明确的地方，请参考上一篇\n",
    "\n",
    "导入所需要使用的库："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取train.csv和test.csv两个表格："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = pd.read_csv('../input/sf-crime/train.csv.zip', parse_dates=['Dates'])\n",
    "test_data = pd.read_csv('../input/sf-crime/test.csv.zip', parse_dates=['Dates'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将两个表格进行拼接，并抛弃训练集的Descript、Resolution两列，测试集的Id一列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 1762311 entries, 0 to 884261\n",
      "Data columns (total 6 columns):\n",
      "Dates         datetime64[ns]\n",
      "DayOfWeek     object\n",
      "PdDistrict    object\n",
      "Address       object\n",
      "X             float64\n",
      "Y             float64\n",
      "dtypes: datetime64[ns](1), float64(2), object(3)\n",
      "memory usage: 94.1+ MB\n"
     ]
    }
   ],
   "source": [
    "all_features = pd.concat((train_data.iloc[:, [0, 3, 4, 6, 7, 8]],\n",
    "                          test_data.iloc[:, [1, 2, 3, 4, 5, 6]]),\n",
    "                         sort=False)\n",
    "all_features.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "记录下训练集的行数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "878049\n"
     ]
    }
   ],
   "source": [
    "num_train = train_data.shape[0]\n",
    "print(num_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "生成训练集标签train_labels："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(878049,)\n",
      "[37 21 21 16 16 16 36 36 16 16]\n"
     ]
    }
   ],
   "source": [
    "train_labels = pd.get_dummies(train_data['Category']).values\n",
    "np.save(\"./data/train_labels_onehot.npy\", train_labels)\n",
    "num_outputs = train_labels.shape[1]\n",
    "train_labels = np.argmax(train_labels, axis=1)\n",
    "np.save(\"./data/train_labels.npy\", train_labels)\n",
    "#PS：存储两个版本的train_labels\n",
    "print(train_labels.shape)\n",
    "print(train_labels[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dates列的数据包含年份、月份、新年（是否是1月、2月）、天、小时、黑夜（是否是18点之后）六个特征，分别进行处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_features['year'] = all_features.Dates.dt.year\n",
    "all_features['month'] = all_features.Dates.dt.month\n",
    "all_features['new_year'] = all_features['month'].apply(\n",
    "    lambda x: 1 if x == 1 or x == 2 else 0)\n",
    "all_features['day'] = all_features.Dates.dt.day\n",
    "all_features['hour'] = all_features.Dates.dt.hour\n",
    "all_features['evening'] = all_features['hour'].apply(lambda x: 1\n",
    "                                                     if x >= 18 else 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "处理DayOfWeek列数据，得到星期几和周末（是否是周五、周六）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "wkm = {\n",
    "    'Monday': 0,\n",
    "    'Tuesday': 1,\n",
    "    'Wednesday': 2,\n",
    "    'Thursday': 3,\n",
    "    'Friday': 4,\n",
    "    'Saturday': 5,\n",
    "    'Sunday': 6\n",
    "}\n",
    "all_features['DayOfWeek'] = all_features['DayOfWeek'].apply(lambda x: wkm[x])\n",
    "all_features['weekend'] = all_features['DayOfWeek'].apply(\n",
    "    lambda x: 1 if x == 4 or x == 5 else 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "独热编码（One-Hot Encoding），又称一位有效编码，其方法是使用N位状态寄存器来对N个状态进行编码，每个状态都有它独立的寄存器位，并且在任意时候，其中只有一位有效。即，只有一位是1，其余都是零值。\n",
    "\n",
    "PdDistrict包含辖区的数据，我们选择用独热编码的方法进行处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "OneHot_features = pd.get_dummies(all_features['PdDistrict'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提取出Address列中街区（是否存在block）的特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_features['block'] = all_features['Address'].apply(\n",
    "    lambda x: 1 if 'block' in x.lower() else 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "按照使用算法的区别，将all_features一分为三："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "PCA_features = all_features[['X', 'Y']].values\n",
    "Standard_features = all_features[['DayOfWeek', 'year', 'month', 'day',\n",
    "                                  'hour']].values\n",
    "OneHot_features = pd.concat([\n",
    "    OneHot_features, all_features[['new_year', 'evening', 'weekend', 'block']]\n",
    "],\n",
    "                            axis=1).values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "特征缩放是用来统一资料中的自变项或特征范围的方法，我们采用特征缩放中标准化的方法对DayOfWeek、year、month、day、hour五列进行处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "scaler.fit(Standard_features)\n",
    "Standard_features = scaler.transform(Standard_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "主成分分析（Principal Component Analysis，PCA）， 是一种统计方法。通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量。\n",
    "\n",
    "我们既想保留X、Y的特征，又想适当削弱X、Y的权重，可以选择对X、Y两列进行主成分分析："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(n_components=2)\n",
    "pca.fit(PCA_features)\n",
    "PCA_features = pca.transform(PCA_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将独热编码的PdDistrict、new_year、evening、weekend、block五部分与其余部分进行拼接，重新得到all_features（总计12个特征，21列）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_features = np.concatenate(\n",
    "    (PCA_features, Standard_features, OneHot_features), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将all_features一分为二，得到处理好的训练集特征train_features和测试集特征test_features，以及网络输入层节点数num_inputs："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_features = all_features[:num_train]\n",
    "num_inputs = train_features.shape[1]\n",
    "test_features = all_features[num_train:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "查看训练集特征train_features、训练集标签train_labels、网络输入层节点数num_inputs、测试集特征test_features、网络输出层节点数num_outputs："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集特征储存于src/data文件夹下train_features\n",
      "train_features共878049行，21列\n",
      "训练集标签储存于src/data文件夹下train_labels\n",
      "train_labels共878049行\n",
      "测试集特征储存于src/data文件夹下test_features\n",
      "test_features共884262行，21列\n",
      "输入层节点数为21\n",
      "输出层节点数为39\n"
     ]
    }
   ],
   "source": [
    "np.save(\"./data/train_features.npy\", train_features)\n",
    "print('训练集特征储存于src/data文件夹下train_features')\n",
    "print('train_features共%d行，%d列' %\n",
    "      (train_features.shape[0], train_features.shape[1]))\n",
    "print('训练集标签储存于src/data文件夹下train_labels')\n",
    "print('train_labels共%d行' % (train_labels.shape[0]))\n",
    "np.save(\"./data/test_features.npy\", (test_features))\n",
    "print('测试集特征储存于src/data文件夹下test_features')\n",
    "print('test_features共%d行，%d列' %\n",
    "      (test_features.shape[0], test_features.shape[1]))\n",
    "print('输入层节点数为%d' % (num_inputs))\n",
    "print('输出层节点数为%d' % (num_outputs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出，训练集特征共21列（11个特征各1列，辖区独热编码占10列），878049行，即：有878049个样本供训练。测试集有884262个样本，需要计算这些样本39种犯罪类型各种类型的可能性（39种总计100%）。\n",
    "\n",
    "至此，模型相关的数据处理基本完成，可以开始后面的步骤。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6.10 64-bit ('Pytorch-learn': conda)",
   "language": "python",
   "name": "python361064bitpytorchlearnconda080df47efea24539a61202fa66a72562"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
